Language Analysis

Quarto

This analysis looks at language usage on the internet over time and across regions within Ukraine. Google Trends data using the gtrendsR package is primarily used. The gtrendsR limits the amount of queries you can conduct and I have not been able to go over the cap.

This analysis somewhat relies on revealed preference theory in assuming that the language people use on the internet is the one that really prefer to use, but not actually express in public or other setting. This is partly inspired by the book by Seth Stephens-Davidowitz Everybody Lies who found people expressed things online that they likely wouldn’t openly express. Still, there are clear limitations with revealed preference theory in the context of internet usage in Ukraine. A clear limitation, especially going back further in time, is that there was a genuine dearth of Ukrainian language resources for many things. Even to this day it’s possible that primarily Ukrainian speakers could search in Russian for certain things where there are few Ukrainian language sources.

Below are intial queries of different words over time that are spelled differently in Russian v Ukrainian over time with ticks for key events and a trend line.

how <-  time_plot(wide_dat(process_g_trends(readRDS("C:/UKR-RU-Language-Analysis/Trend Queries/how2010_01_01_2025_05_01.RDS"))[[1]]), "How: Як/как")
Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
i Please use `linewidth` instead.
News <-  time_plot(wide_dat(process_g_trends(readRDS("C:/UKR-RU-Language-Analysis/Trend Queries/news2010_01_01_2025_05_01.rds"))[[1]]), "News: новини/новости")
what <-  time_plot(wide_dat(process_g_trends(readRDS("C:/UKR-RU-Language-Analysis/Trend Queries/what2010_01_01_2025_05_01.rds"))[[1]]), "What is: Що таке/Что такое")
price <-  time_plot(wide_dat(process_g_trends(readRDS("C:/UKR-RU-Language-Analysis/Trend Queries/recipes2010_01_01_2025_05_01.RDS"))[[1]]), "Price: Ціна/Цена")
games <-  time_plot(wide_dat(process_g_trends(readRDS("C:/UKR-RU-Language-Analysis/Trend Queries/games2010_01_01_2025_05_01.RDS"))[[1]]), "Games: Ігри/Игри")
recipes <-  time_plot(wide_dat(process_g_trends(readRDS("C:/UKR-RU-Language-Analysis/Trend Queries/recipes2010_01_01_2025_05_01.RDS"))[[1]]), "Recipes: Рецепти/Рецепты")
price
`geom_smooth()` using formula = 'y ~ x'

News
`geom_smooth()` using formula = 'y ~ x'

#Plot over date
(how ) / (what) +
  plot_annotation(title = "Graphs of Ukrainian/Russian Ratios over Time")
`geom_smooth()` using formula = 'y ~ x'
`geom_smooth()` using formula = 'y ~ x'

# time_plot(wide_dat(process_g_trends(readRDS("C:/UKR-RU-Language-Analysis/Trend Queries/how2010_01_01_2025_05_01.RDS"))[[1]]), "How: Як/как")
# time_plot(wide_dat(process_g_trends(readRDS("C:/UKR-RU-Language-Analysis/Trend Queries/news2010_01_01_2025_05_01.rds"))[[1]]), "News: новини/новости")
# time_plot(wide_dat(process_g_trends(readRDS("C:/UKR-RU-Language-Analysis/Trend Queries/what2010_01_01_2025_05_01.rds"))[[1]]), "What is: Що таке/Что такое")
# time_plot(wide_dat(process_g_trends(readRDS("C:/UKR-RU-Language-Analysis/Trend Queries/recipes2010_01_01_2025_05_01.RDS"))[[1]]), "Price: Ціна/Цена")
# time_plot(wide_dat(process_g_trends(readRDS("C:/UKR-RU-Language-Analysis/Trend Queries/games2010_01_01_2025_05_01.RDS"))[[1]]), "Games: Ігри/Игри")
# time_plot(wide_dat(process_g_trends(readRDS("C:/UKR-RU-Language-Analysis/Trend Queries/recipes2010_01_01_2025_05_01.RDS"))[[1]]), "Recipes: Рецепти/Рецепты")

Below looks at within Oblasts/Cities the ratio for searches that were between 2010 and May 1 2025. Table by ratio

how_region <- readRDS("C:/UKR-RU-Language-Analysis/Trend Queries/how2010_01_01_2025_05_01.RDS")[[3]]

#Table Ranking By Region
reg_ru <- how_region  %>%
  filter(keyword=="Как") %>%
  arrange(location) %>% select(hits) 

reg_ua <- how_region  %>%
  filter(keyword=="Як") %>%
  arrange(location) %>% select(location,hits) 

#Creating separte columns for two diff search terms
reg_ua <- rename(reg_ua,hits_ua=hits)
reg_ru <- rename(reg_ru,hits_ru=hits)
regions_tab <- cbind(reg_ua,reg_ru)
#Setting Russian to numeric and sorting by it. Could be better to look at rus realtive to ukr as ratio?
regions_tab$hits_ua <-  ifelse(regions_tab$hits_ua == "<1",0,regions_tab$hits_ua)
regions_tab$hits_ua <-  as.numeric(ifelse(regions_tab$hits_ua == "",0,regions_tab$hits_ua))
regions_tab$hits_ru <- as.numeric(regions_tab$hits_ru)

regions_tab$rat <- regions_tab$hits_ua / regions_tab$hits_ru

regions_tab <-  arrange(regions_tab, -regions_tab$rat)
names(regions_tab) <- c("Oblast", "Hits UA", "Hits RU", "Ratio UA/RU")
# stargazer(regions_tab,summary=FALSE,
#           title= "Ukrainian Oblasts and Cities Ranked by Ukrainian/Russian Ratio", notes = "Crimea and Sevastapol had insignificant Ukrainian results",header =FALSE)

regions_tab %>%
  gt() %>%
  tab_header(
    title = "Ukrainian Oblasts and Cities Ranked by Ukrainian-to-Russian Search Ratio",
    subtitle = "Search term: 'how' ('Як' vs 'Как') from Google Trends"
  ) %>%
  fmt_number(columns = c(`Hits UA`, `Hits RU`, `Ratio UA/RU`), decimals = 2) %>%
  tab_footnote(
    footnote = "Crimea and Sevastopol may have suppressed or missing data.",
    locations = cells_title(groups = "title")
  ) %>%
  cols_label(
    `Hits UA` = "Hits (UA)",
    `Hits RU` = "Hits (RU)",
    `Ratio UA/RU` = "UA/RU Ratio"
  ) %>%
  opt_table_outline()
Ukrainian Oblasts and Cities Ranked by Ukrainian-to-Russian Search Ratio1
Search term: 'how' ('Як' vs 'Как') from Google Trends
Oblast Hits (UA) Hits (RU) UA/RU Ratio
Ternopil's'ka oblast 98.00 19.00 5.16
Ivano-Frankivs'ka oblast 100.00 21.00 4.76
Volyns'ka oblast 94.00 24.00 3.92
Rivnens'ka oblast 85.00 26.00 3.27
Lviv Oblast 79.00 26.00 3.04
Khmel'nyts'ka oblast 84.00 35.00 2.40
Zakarpats'ka oblast 66.00 35.00 1.89
Vinnyts'ka oblast 76.00 42.00 1.81
Chernivets'ka oblast 67.00 38.00 1.76
Zhytomyrs'ka oblast 64.00 49.00 1.31
Cherkas'ka oblast 60.00 50.00 1.20
Kyivs'ka oblast 51.00 50.00 1.02
Poltavs'ka oblast 43.00 55.00 0.78
Chernihivs'ka oblast 44.00 60.00 0.73
Kirovohrads'ka oblast 44.00 74.00 0.59
Sums'ka oblast 36.00 65.00 0.55
Kyiv city 28.00 59.00 0.47
Mykolaivs'ka oblast 27.00 72.00 0.38
Dnipropetrovsk Oblast 20.00 68.00 0.29
Khersons'ka oblast 17.00 79.00 0.22
Zaporiz'ka oblast 15.00 81.00 0.19
Odessa Oblast 15.00 82.00 0.18
Kharkiv Oblast 14.00 79.00 0.18
Donetsk Oblast 6.00 84.00 0.07
Luhans'ka oblast 4.00 91.00 0.04
Crimea 0.00 100.00 0.00
Sevastopol' city 0.00 94.00 0.00
1 Crimea and Sevastopol may have suppressed or missing data.
how_region <- readRDS("C:/UKR-RU-Language-Analysis/Trend Queries/how2010_01_01_2025_05_01.RDS")[[3]]

#Table Ranking By Region
reg_ru <- how_region  %>%
  filter(keyword=="Как") %>%
  arrange(location) %>% select(hits) 

reg_ua <- how_region  %>%
  filter(keyword=="Як") %>%
  arrange(location) %>% select(location,hits) 

#Creating separte columns for two diff search terms
reg_ua <- rename(reg_ua,hits_ua=hits)
reg_ru <- rename(reg_ru,hits_ru=hits)
regions_tab <- cbind(reg_ua,reg_ru)
#Setting Russian to numeric and sorting by it. Could be better to look at rus realtive to ukr as ratio?
regions_tab$hits_ua <-  ifelse(regions_tab$hits_ua == "<1",0,regions_tab$hits_ua)
regions_tab$hits_ua <-  as.numeric(ifelse(regions_tab$hits_ua == "",0,regions_tab$hits_ua))
regions_tab$hits_ru <- as.numeric(regions_tab$hits_ru)

regions_tab$rat <- regions_tab$hits_ua / regions_tab$hits_ru

regions_tab <-  arrange(regions_tab, -regions_tab$rat)
names(regions_tab) <- c("Oblast", "Hits UA", "Hits RU", "Ratio UA/RU")
# stargazer(regions_tab,summary=FALSE,
#           title= "Ukrainian Oblasts and Cities Ranked by Ukrainian/Russian Ratio", notes = "Crimea and Sevastapol had insignificant Ukrainian results",header =FALSE)

regions_tab %>%
  gt() %>%
  tab_header(
    title = "Ukrainian Oblasts and Cities Ranked by Ukrainian-to-Russian Search Ratio",
    subtitle = "Search term: 'how' ('Як' vs 'Как') from Google Trends"
  ) %>%
  fmt_number(columns = c(`Hits UA`, `Hits RU`, `Ratio UA/RU`), decimals = 2) %>%
  tab_footnote(
    footnote = "Crimea and Sevastopol may have suppressed or missing data.",
    locations = cells_title(groups = "title")
  ) %>%
  cols_label(
    `Hits UA` = "Hits (UA)",
    `Hits RU` = "Hits (RU)",
    `Ratio UA/RU` = "UA/RU Ratio"
  ) %>%
  opt_table_outline()
Ukrainian Oblasts and Cities Ranked by Ukrainian-to-Russian Search Ratio1
Search term: 'how' ('Як' vs 'Как') from Google Trends
Oblast Hits (UA) Hits (RU) UA/RU Ratio
Ternopil's'ka oblast 98.00 19.00 5.16
Ivano-Frankivs'ka oblast 100.00 21.00 4.76
Volyns'ka oblast 94.00 24.00 3.92
Rivnens'ka oblast 85.00 26.00 3.27
Lviv Oblast 79.00 26.00 3.04
Khmel'nyts'ka oblast 84.00 35.00 2.40
Zakarpats'ka oblast 66.00 35.00 1.89
Vinnyts'ka oblast 76.00 42.00 1.81
Chernivets'ka oblast 67.00 38.00 1.76
Zhytomyrs'ka oblast 64.00 49.00 1.31
Cherkas'ka oblast 60.00 50.00 1.20
Kyivs'ka oblast 51.00 50.00 1.02
Poltavs'ka oblast 43.00 55.00 0.78
Chernihivs'ka oblast 44.00 60.00 0.73
Kirovohrads'ka oblast 44.00 74.00 0.59
Sums'ka oblast 36.00 65.00 0.55
Kyiv city 28.00 59.00 0.47
Mykolaivs'ka oblast 27.00 72.00 0.38
Dnipropetrovsk Oblast 20.00 68.00 0.29
Khersons'ka oblast 17.00 79.00 0.22
Zaporiz'ka oblast 15.00 81.00 0.19
Odessa Oblast 15.00 82.00 0.18
Kharkiv Oblast 14.00 79.00 0.18
Donetsk Oblast 6.00 84.00 0.07
Luhans'ka oblast 4.00 91.00 0.04
Crimea 0.00 100.00 0.00
Sevastopol' city 0.00 94.00 0.00
1 Crimea and Sevastopol may have suppressed or missing data.
how_region <- readRDS("C:/UKR-RU-Language-Analysis/Trend Queries/how2010_01_01_2025_05_01.RDS")[[3]]

#Table Ranking By Region
reg_ru <- how_region  %>%
  filter(keyword=="Как") %>%
  arrange(location) %>% select(hits) 

reg_ua <- how_region  %>%
  filter(keyword=="Як") %>%
  arrange(location) %>% select(location,hits) 

#Creating separte columns for two diff search terms
reg_ua <- rename(reg_ua,hits_ua=hits)
reg_ru <- rename(reg_ru,hits_ru=hits)

regions_tab <- cbind(reg_ua,reg_ru)
regions_tab$hits_ua <-  ifelse(regions_tab$hits_ua == "<1",0,regions_tab$hits_ua)
regions_tab$hits_ua <-  as.numeric(ifelse(regions_tab$hits_ua == "",0,regions_tab$hits_ua))
regions_tab$hits_ru <- as.numeric(regions_tab$hits_ru)
regions_tab$rat <- regions_tab$hits_ua / regions_tab$hits_ru
ukraine_sf <- st_read("C:/UKR-RU-Language-Analysis/Shapefiles/gadm41_UKR_1.shp")
Reading layer `gadm41_UKR_1' from data source 
  `C:\UKR-RU-Language-Analysis\Shapefiles\gadm41_UKR_1.shp' using driver `ESRI Shapefile'
Simple feature collection with 28 features and 11 fields
Geometry type: MULTIPOLYGON
Dimension:     XY
Bounding box:  xmin: 22.14045 ymin: 44.38597 xmax: 40.21807 ymax: 52.37503
Geodetic CRS:  WGS 84
#standardize names
nam <- read.csv("C:/UKR-RU-Language-Analysis/Shapefiles/name_lookup.csv")

#Make sure names match
regions_tab <-cbind(regions_tab,shapefile_name= nam$shapefile_name)

map_data <- ukraine_sf %>%
  left_join(regions_tab, by = c("NAME_1" = "shapefile_name"))

#Kiev city fails join for some reason. Manually correcting and renmaing
map_data$rat[13] <- regions_tab$rat[12]
map_data$NAME_1[13] <- "Kyiv City"

breaks <- c(0, .1,.25, 0.5, 1, 2,4,10, Inf)
labels <- c("<0.25", "0.25–0.5", "0.5–1", "1–2", ">2")

tmap_mode("plot")
i tmap mode set to "plot".
tm_shape(map_data) +
  tm_polygons("rat", palette = "RdYlGn", style = "fixed",
              breaks = breaks,
              palette = "RdYlGn",
              title = "UA/RU Language Ratio",orientation = "landscape") +
  tm_layout(title = "Ukrainian-to-Russian Search Ratio by Oblast",
            legend.outside = TRUE)

-- tmap v3 code detected -------------------------------------------------------
[v3->v4] `tm_polygons()`: instead of `style = "fixed"`, use fill.scale =
`tm_scale_intervals()`.
i Migrate the argument(s) 'style', 'breaks', 'palette' (rename to 'values') to
  'tm_scale_intervals(<HERE>)'[v3->v4] `tm_polygons()`: migrate the argument(s) related to the legend of the
visual variable `fill` namely 'title' to 'fill.legend = tm_legend(<HERE>)'[tm_polygons()] Argument `orientation` unknown.[v3->v4] `tm_layout()`: use `tm_title()` instead of `tm_layout(title = )`[cols4all] color palettes: use palettes from the R package cols4all. Run
`cols4all::c4a_gui()` to explore them. The old palette name "RdYlGn" is named
"brewer.rd_yl_gn"Multiple palettes called "rd_yl_gn" found: "brewer.rd_yl_gn", "matplotlib.rd_yl_gn". The first one, "brewer.rd_yl_gn", is returned.

tmap_mode("view")  # Enables interactive mode
i tmap mode set to "view".
tm_shape(map_data) +
  tm_polygons("rat", palette = "RdYlGn", style = "fixed",
              breaks = breaks,
              title = "UA/RU Language Ratio", 
              popup.vars = c("Oblast" = "NAME_1", "Ratio" = "rat")) +
  tm_layout(title = "Interactive Ukrainian-to-Russian Search Ratio Map")

-- tmap v3 code detected -------------------------------------------------------
[v3->v4] `tm_polygons()`: instead of `style = "fixed"`, use fill.scale =
`tm_scale_intervals()`.
i Migrate the argument(s) 'style', 'breaks', 'palette' (rename to 'values') to
  'tm_scale_intervals(<HERE>)'[v3->v4] `tm_polygons()`: migrate the argument(s) related to the legend of the
visual variable `fill` namely 'title' to 'fill.legend = tm_legend(<HERE>)'[v3->v4] `tm_layout()`: use `tm_title()` instead of `tm_layout(title = )`[cols4all] color palettes: use palettes from the R package cols4all. Run
`cols4all::c4a_gui()` to explore them. The old palette name "RdYlGn" is named
"brewer.rd_yl_gn"Multiple palettes called "rd_yl_gn" found: "brewer.rd_yl_gn", "matplotlib.rd_yl_gn". The first one, "brewer.rd_yl_gn", is returned.

The echo: false option disables the printing of code (only output is displayed).